{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "bef41716",
   "metadata": {
    "origin_pos": 0
   },
   "source": [
    "# Generalization\n",
    ":label:`sec_generalization_basics`\n",
    "\n",
    "Consider two college students diligently\n",
    "preparing for their final exam.\n",
    "Commonly, this preparation will consist\n",
    "of practicing and testing their abilities\n",
    "by taking exams administered in previous years.\n",
    "Nonetheless, doing well on past exams is no guarantee\n",
    "that they will excel when it matters.\n",
    "For instance, imagine a student, Extraordinary Ellie,\n",
    "whose preparation consisted entirely\n",
    "of memorizing the answers\n",
    "to previous years' exam questions.\n",
    "Even if Ellie were endowed\n",
    "with an extraordinary memory,\n",
    "and thus could perfectly recall the answer\n",
    "to any *previously seen* question,\n",
    "she might nevertheless freeze\n",
    "when faced with a new (*previously unseen*) question.\n",
    "By comparison, imagine another student,\n",
    "Inductive Irene, with comparably poor\n",
    "memorization skills,\n",
    "but a knack for picking up patterns.\n",
    "Note that if the exam truly consisted of\n",
    "recycled questions from a previous year,\n",
    "Ellie would handily outperform Irene.\n",
    "Even if Irene's inferred patterns\n",
    "yielded 90% accurate predictions,\n",
    "they could never compete with\n",
    "Ellie's 100% recall.\n",
    "However, even if the exam consisted\n",
    "entirely of fresh questions,\n",
    "Irene might maintain her 90% average.\n",
    "\n",
    "As machine learning scientists,\n",
    "our goal is to discover *patterns*.\n",
    "But how can we be sure that we have\n",
    "truly discovered a *general* pattern\n",
    "and not simply memorized our data?\n",
    "Most of the time, our predictions are only useful\n",
    "if our model discovers such a pattern.\n",
    "We do not want to predict yesterday's stock prices, but tomorrow's.\n",
    "We do not need to recognize\n",
    "already diagnosed diseases\n",
    "for previously seen patients,\n",
    "but rather previously undiagnosed\n",
    "ailments in previously unseen patients.\n",
    "This problem---how to discover patterns that *generalize*---is\n",
    "the fundamental problem of machine learning,\n",
    "and arguably of all of statistics.\n",
    "We might cast this problem as just one slice\n",
    "of a far grander question\n",
    "that engulfs all of science:\n",
    "when are we ever justified\n",
    "in making the leap from particular observations\n",
    "to more general statements?\n",
    "\n",
    "\n",
    "In real life, we must fit our models\n",
    "using a finite collection of data.\n",
    "The typical scales of that data\n",
    "vary wildly across domains.\n",
    "For many important medical problems,\n",
    "we can only access a few thousand data points.\n",
    "When studying rare diseases,\n",
    "we might be lucky to access hundreds.\n",
    "By contrast, the largest public datasets\n",
    "consisting of labeled photographs,\n",
    "e.g., ImageNet :cite:`Deng.Dong.Socher.ea.2009`,\n",
    "contain millions of images.\n",
    "And some unlabeled image collections\n",
    "such as the Flickr YFC100M dataset\n",
    "can be even larger, containing\n",
    "over 100 million images :cite:`thomee2016yfcc100m`.\n",
    "However, even at this extreme scale,\n",
    "the number of available data points\n",
    "remains infinitesimally small\n",
    "compared to the space of all possible images\n",
    "at a megapixel resolution.\n",
    "Whenever we work with finite samples,\n",
    "we must keep in mind the risk\n",
    "that we might fit our training data,\n",
    "only to discover that we failed\n",
    "to discover a generalizable pattern.\n",
    "\n",
    "The phenomenon of fitting closer to our training data\n",
    "than to the underlying distribution is called *overfitting*,\n",
    "and techniques for combatting overfitting\n",
    "are often called *regularization* methods.\n",
    "While it is no substitute for a proper introduction\n",
    "to statistical learning theory (see :citet:`Vapnik98,boucheron2005theory`),\n",
    "we will give you just enough intuition to get going.\n",
    "We will revisit generalization in many chapters\n",
    "throughout the book,\n",
    "exploring both what is known about\n",
    "the principles underlying generalization\n",
    "in various models,\n",
    "and also heuristic techniques\n",
    "that have been found (empirically)\n",
    "to yield improved generalization\n",
    "on tasks of practical interest.\n",
    "\n",
    "\n",
    "\n",
    "## Training Error and Generalization Error\n",
    "\n",
    "\n",
    "In the standard supervised learning setting,\n",
    "we assume that the training data and the test data\n",
    "are drawn *independently* from *identical* distributions.\n",
    "This is commonly called the *IID assumption*.\n",
    "While this assumption is strong,\n",
    "it is worth noting that, absent any such assumption,\n",
    "we would be dead in the water.\n",
    "Why should we believe that training data\n",
    "sampled from distribution $P(X,Y)$\n",
    "should tell us how to make predictions on\n",
    "test data generated by a *different distribution* $Q(X,Y)$?\n",
    "Making such leaps turns out to require\n",
    "strong assumptions about how $P$ and $Q$ are related.\n",
    "Later on we will discuss some assumptions\n",
    "that allow for shifts in distribution\n",
    "but first we need to understand the IID case,\n",
    "where $P(\\cdot) = Q(\\cdot)$.\n",
    "\n",
    "To begin with, we need to differentiate between\n",
    "the *training error* $R_\\textrm{emp}$,\n",
    "which is a *statistic*\n",
    "calculated on the training dataset,\n",
    "and the *generalization error* $R$,\n",
    "which is an *expectation* taken\n",
    "with respect to the underlying distribution.\n",
    "You can think of the generalization error as\n",
    "what you would see  if you applied your model\n",
    "to an infinite stream of additional data examples\n",
    "drawn from the same underlying data distribution.\n",
    "Formally the training error is expressed as a *sum* (with the same notation as :numref:`sec_linear_regression`):\n",
    "\n",
    "$$R_\\textrm{emp}[\\mathbf{X}, \\mathbf{y}, f] = \\frac{1}{n} \\sum_{i=1}^n l(\\mathbf{x}^{(i)}, y^{(i)}, f(\\mathbf{x}^{(i)})),$$\n",
    "\n",
    "\n",
    "while the generalization error is expressed as an integral:\n",
    "\n",
    "$$R[p, f] = E_{(\\mathbf{x}, y) \\sim P} [l(\\mathbf{x}, y, f(\\mathbf{x}))] =\n",
    "\\int \\int l(\\mathbf{x}, y, f(\\mathbf{x})) p(\\mathbf{x}, y) \\;d\\mathbf{x} dy.$$\n",
    "\n",
    "Problematically, we can never calculate\n",
    "the generalization error $R$ exactly.\n",
    "Nobody ever tells us the precise form\n",
    "of the density function $p(\\mathbf{x}, y)$.\n",
    "Moreover, we cannot sample an infinite stream of data points.\n",
    "Thus, in practice, we must *estimate* the generalization error\n",
    "by applying our model to an independent test set\n",
    "constituted of a random selection of examples\n",
    "$\\mathbf{X}'$ and labels $\\mathbf{y}'$\n",
    "that were withheld from our training set.\n",
    "This consists of applying the same formula\n",
    "that was used for calculating the empirical training error\n",
    "but to a test set $\\mathbf{X}', \\mathbf{y}'$.\n",
    "\n",
    "\n",
    "Crucially, when we evaluate our classifier on the test set,\n",
    "we are working with a *fixed* classifier\n",
    "(it does not depend on the sample of the test set),\n",
    "and thus estimating its error\n",
    "is simply the problem of mean estimation.\n",
    "However the same cannot be said\n",
    "for the training set.\n",
    "Note that the model we wind up with\n",
    "depends explicitly on the selection of the training set\n",
    "and thus the training error will in general\n",
    "be a biased estimate of the true error\n",
    "on the underlying population.\n",
    "The central question of generalization\n",
    "is then when should we expect our training error\n",
    "to be close to the population error\n",
    "(and thus the generalization error).\n",
    "\n",
    "### Model Complexity\n",
    "\n",
    "In classical theory, when we have\n",
    "simple models and abundant data,\n",
    "the training and generalization errors tend to be close.\n",
    "However, when we work with\n",
    "more complex models and/or fewer examples,\n",
    "we expect the training error to go down\n",
    "but the generalization gap to grow.\n",
    "This should not be surprising.\n",
    "Imagine a model class so expressive that\n",
    "for any dataset of $n$ examples,\n",
    "we can find a set of parameters\n",
    "that can perfectly fit arbitrary labels,\n",
    "even if randomly assigned.\n",
    "In this case, even if we fit our training data perfectly,\n",
    "how can we conclude anything about the generalization error?\n",
    "For all we know, our generalization error\n",
    "might be no better than random guessing.\n",
    "\n",
    "In general, absent any restriction on our model class,\n",
    "we cannot conclude, based on fitting the training data alone,\n",
    "that our model has discovered any generalizable pattern :cite:`vapnik1994measuring`.\n",
    "On the other hand, if our model class\n",
    "was not capable of fitting arbitrary labels,\n",
    "then it must have discovered a pattern.\n",
    "Learning-theoretic ideas about model complexity\n",
    "derived some inspiration from the ideas\n",
    "of Karl Popper, an influential philosopher of science,\n",
    "who formalized the criterion of falsifiability.\n",
    "According to Popper, a theory\n",
    "that can explain any and all observations\n",
    "is not a scientific theory at all!\n",
    "After all, what has it told us about the world\n",
    "if it has not ruled out any possibility?\n",
    "In short, what we want is a hypothesis\n",
    "that *could not* explain any observations\n",
    "we might conceivably make\n",
    "and yet nevertheless happens to be compatible\n",
    "with those observations that we *in fact* make.\n",
    "\n",
    "Now what precisely constitutes an appropriate\n",
    "notion of model complexity is a complex matter.\n",
    "Often, models with more parameters\n",
    "are able to fit a greater number\n",
    "of arbitrarily assigned labels.\n",
    "However, this is not necessarily true.\n",
    "For instance, kernel methods operate in spaces\n",
    "with infinite numbers of parameters,\n",
    "yet their complexity is controlled\n",
    "by other means :cite:`Scholkopf.Smola.2002`.\n",
    "One notion of complexity that often proves useful\n",
    "is the range of values that the parameters can take.\n",
    "Here, a model whose parameters are permitted\n",
    "to take arbitrary values\n",
    "would be more complex.\n",
    "We will revisit this idea in the next section,\n",
    "when we introduce *weight decay*,\n",
    "your first practical regularization technique.\n",
    "Notably, it can be difficult to compare\n",
    "complexity among members of substantially different model classes\n",
    "(say, decision trees vs. neural networks).\n",
    "\n",
    "\n",
    "At this point, we must stress another important point\n",
    "that we will revisit when introducing deep neural networks.\n",
    "When a model is capable of fitting arbitrary labels,\n",
    "low training error does not necessarily\n",
    "imply low generalization error.\n",
    "*However, it does not necessarily\n",
    "imply high generalization error either!*\n",
    "All we can say with confidence is that\n",
    "low training error alone is not enough\n",
    "to certify low generalization error.\n",
    "Deep neural networks turn out to be just such models:\n",
    "while they generalize well in practice,\n",
    "they are too powerful to allow us to conclude\n",
    "much on the basis of training error alone.\n",
    "In these cases we must rely more heavily\n",
    "on our holdout data to certify generalization\n",
    "after the fact.\n",
    "Error on the holdout data, i.e., validation set,\n",
    "is called the *validation error*.\n",
    "\n",
    "## Underfitting or Overfitting?\n",
    "\n",
    "When we compare the training and validation errors,\n",
    "we want to be mindful of two common situations.\n",
    "First, we want to watch out for cases\n",
    "when our training error and validation error are both substantial\n",
    "but there is a little gap between them.\n",
    "If the model is unable to reduce the training error,\n",
    "that could mean that our model is too simple\n",
    "(i.e., insufficiently expressive)\n",
    "to capture the pattern that we are trying to model.\n",
    "Moreover, since the *generalization gap* ($R_\\textrm{emp} - R$)\n",
    "between our training and generalization errors is small,\n",
    "we have reason to believe that we could get away with a more complex model.\n",
    "This phenomenon is known as *underfitting*.\n",
    "\n",
    "On the other hand, as we discussed above,\n",
    "we want to watch out for the cases\n",
    "when our training error is significantly lower\n",
    "than our validation error, indicating severe *overfitting*.\n",
    "Note that overfitting is not always a bad thing.\n",
    "In deep learning especially,\n",
    "the best predictive models often perform\n",
    "far better on training data than on holdout data.\n",
    "Ultimately, we usually care about\n",
    "driving the generalization error lower,\n",
    "and only care about the gap insofar\n",
    "as it becomes an obstacle to that end.\n",
    "Note that if the training error is zero,\n",
    "then the generalization gap is precisely equal to the generalization error\n",
    "and we can make progress only by reducing the gap.\n",
    "\n",
    "### Polynomial Curve Fitting\n",
    ":label:`subsec_polynomial-curve-fitting`\n",
    "\n",
    "To illustrate some classical intuition\n",
    "about overfitting and model complexity,\n",
    "consider the following:\n",
    "given training data consisting of a single feature $x$\n",
    "and a corresponding real-valued label $y$,\n",
    "we try to find the polynomial of degree $d$\n",
    "\n",
    "$$\\hat{y}= \\sum_{i=0}^d x^i w_i$$\n",
    "\n",
    "for estimating the label $y$.\n",
    "This is just a linear regression problem\n",
    "where our features are given by the powers of $x$,\n",
    "the model's weights are given by $w_i$,\n",
    "and the bias is given by $w_0$ since $x^0 = 1$ for all $x$.\n",
    "Since this is just a linear regression problem,\n",
    "we can use the squared error as our loss function.\n",
    "\n",
    "\n",
    "A higher-order polynomial function is more complex\n",
    "than a lower-order polynomial function,\n",
    "since the higher-order polynomial has more parameters\n",
    "and the model function's selection range is wider.\n",
    "Fixing the training dataset,\n",
    "higher-order polynomial functions should always\n",
    "achieve lower (at worst, equal) training error\n",
    "relative to lower-degree polynomials.\n",
    "In fact, whenever each data example\n",
    "has a distinct value of $x$,\n",
    "a polynomial function with degree\n",
    "equal to the number of data examples\n",
    "can fit the training set perfectly.\n",
    "We compare the relationship between polynomial degree (model complexity)\n",
    "and both underfitting and overfitting in :numref:`fig_capacity_vs_error`.\n",
    "\n",
    "![Influence of model complexity on underfitting and overfitting.](../img/capacity-vs-error.svg)\n",
    ":label:`fig_capacity_vs_error`\n",
    "\n",
    "\n",
    "### Dataset Size\n",
    "\n",
    "As the above bound already indicates,\n",
    "another big consideration\n",
    "to bear in mind is dataset size.\n",
    "Fixing our model, the fewer samples\n",
    "we have in the training dataset,\n",
    "the more likely (and more severely)\n",
    "we are to encounter overfitting.\n",
    "As we increase the amount of training data,\n",
    "the generalization error typically decreases.\n",
    "Moreover, in general, more data never hurts.\n",
    "For a fixed task and data distribution,\n",
    "model complexity should not increase\n",
    "more rapidly than the amount of data.\n",
    "Given more data, we might  attempt\n",
    "to fit a more complex model.\n",
    "Absent sufficient data, simpler models\n",
    "may be more difficult to beat.\n",
    "For many tasks, deep learning\n",
    "only outperforms linear models\n",
    "when many thousands of training examples are available.\n",
    "In part, the current success of deep learning\n",
    "owes considerably to the abundance of massive datasets\n",
    "arising from Internet companies, cheap storage,\n",
    "connected devices, and the broad digitization of the economy.\n",
    "\n",
    "## Model Selection\n",
    ":label:`subsec_generalization-model-selection`\n",
    "\n",
    "Typically, we select our final model\n",
    "only after evaluating multiple models\n",
    "that differ in various ways\n",
    "(different architectures, training objectives,\n",
    "selected features, data preprocessing,\n",
    "learning rates, etc.).\n",
    "Choosing among many models is aptly\n",
    "called *model selection*.\n",
    "\n",
    "In principle, we should not touch our test set\n",
    "until after we have chosen all our hyperparameters.\n",
    "Were we to use the test data in the model selection process,\n",
    "there is a risk that we might overfit the test data.\n",
    "Then we would be in serious trouble.\n",
    "If we overfit our training data,\n",
    "there is always the evaluation on test data to keep us honest.\n",
    "But if we overfit the test data, how would we ever know?\n",
    "See :citet:`ong2005learning` for an example of how\n",
    "this can lead to absurd results even for models where the complexity\n",
    "can be tightly controlled.\n",
    "\n",
    "Thus, we should never rely on the test data for model selection.\n",
    "And yet we cannot rely solely on the training data\n",
    "for model selection either because\n",
    "we cannot estimate the generalization error\n",
    "on the very data that we use to train the model.\n",
    "\n",
    "\n",
    "In practical applications, the picture gets muddier.\n",
    "While ideally we would only touch the test data once,\n",
    "to assess the very best model or to compare\n",
    "a small number of models with each other,\n",
    "real-world test data is seldom discarded after just one use.\n",
    "We can seldom afford a new test set for each round of experiments.\n",
    "In fact, recycling benchmark data for decades\n",
    "can have a significant impact on the\n",
    "development of algorithms,\n",
    "e.g., for [image classification](https://paperswithcode.com/sota/image-classification-on-imagenet)\n",
    "and [optical character recognition](https://paperswithcode.com/sota/image-classification-on-mnist).\n",
    "\n",
    "The common practice for addressing the problem of *training on the test set*\n",
    "is to split our data three ways,\n",
    "incorporating a *validation set*\n",
    "in addition to the training and test datasets.\n",
    "The result is a murky business where the boundaries\n",
    "between validation and test data are worryingly ambiguous.\n",
    "Unless explicitly stated otherwise, in the experiments in this book\n",
    "we are really working with what should rightly be called\n",
    "training data and validation data, with no true test sets.\n",
    "Therefore, the accuracy reported in each experiment of the book is really\n",
    "the validation accuracy and not a true test set accuracy.\n",
    "\n",
    "### Cross-Validation\n",
    "\n",
    "When training data is scarce,\n",
    "we might not even be able to afford to hold out\n",
    "enough data to constitute a proper validation set.\n",
    "One popular solution to this problem is to employ\n",
    "$K$*-fold cross-validation*.\n",
    "Here, the original training data is split into $K$ non-overlapping subsets.\n",
    "Then model training and validation are executed $K$ times,\n",
    "each time training on $K-1$ subsets and validating\n",
    "on a different subset (the one not used for training in that round).\n",
    "Finally, the training and validation errors are estimated\n",
    "by averaging over the results from the $K$ experiments.\n",
    "\n",
    "\n",
    "\n",
    "## Summary\n",
    "\n",
    "This section explored some of the  underpinnings\n",
    "of generalization in  machine learning.\n",
    "Some of these ideas become complicated\n",
    "and counterintuitive when we get to deeper models; here, models are capable of overfitting data badly,\n",
    "and the relevant notions of complexity\n",
    "can be both implicit and counterintuitive\n",
    "(e.g., larger architectures with more parameters\n",
    "generalizing better).\n",
    "We leave you with a few rules of thumb:\n",
    "\n",
    "1. Use validation sets (or $K$*-fold cross-validation*) for model selection;\n",
    "1. More complex models often require more data;\n",
    "1. Relevant notions of complexity include both the number of parameters and the range of values that they are allowed to take;\n",
    "1. Keeping all else equal, more data almost always leads to better generalization;\n",
    "1. This entire talk of generalization is all predicated on the IID assumption. If we relax this assumption, allowing for distributions to shift between the train and testing periods, then we cannot say anything about generalization absent a further (perhaps milder) assumption.\n",
    "\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. When can you solve the problem of polynomial regression exactly?\n",
    "1. Give at least five examples where dependent random variables make treating the problem as IID data inadvisable.\n",
    "1. Can you ever expect to see zero training error? Under which circumstances would you see zero generalization error?\n",
    "1. Why is $K$-fold cross-validation very expensive to compute?\n",
    "1. Why is the $K$-fold cross-validation error estimate biased?\n",
    "1. The VC dimension is defined as the maximum number of points that can be classified with arbitrary labels $\\{\\pm 1\\}$ by a function of a class of functions. Why might this not be a good idea for measuring how complex the class of functions is? Hint: consider the magnitude of the functions.\n",
    "1. Your manager gives you a difficult dataset on which your current algorithm does not perform so well. How would you justify to him that you need more data? Hint: you cannot increase the data but you can decrease it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "531e427e",
   "metadata": {
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/97)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}